This notebook documents the bottom-up strategy exploration to determine notebook similarity. It is based on the notion that it is easier to aggregate than to break down a 'black box.'
The biggest challenge is working with the AST structure. Because it is a tree, we need to merge leafs with their parents, working our way up.
The Goal is to come up with a similarity function for entire notebooks
In [1]:
# Necessary imports
import os
import time
from nbminer.notebook_miner import NotebookMiner
from nbminer.cells.cells import Cell
from nbminer.features.ast_features import ASTFeatures
from nbminer.stats.summary import Summary
from nbminer.stats.multiple_summary import MultipleSummary
In [2]:
#Loading in the notebooks
people = os.listdir('../testbed/Final')
notebooks = []
for person in people:
person = os.path.join('../testbed/Final', person)
if os.path.isdir(person):
direc = os.listdir(person)
notebooks.extend([os.path.join(person, filename) for filename in direc if filename.endswith('.ipynb')])
notebook_objs = [NotebookMiner(file) for file in notebooks]
a = ASTFeatures(notebook_objs)
In [3]:
# For each notebook, break notebook up into top level AST nodes
for i, nb in enumerate(a.nb_features):
a.nb_features[i] = nb.get_new_notebook()
In [4]:
import networkx
from collections import deque
import ast
In [5]:
# Function that returns a networkx graph from a top level AST node
def return_graph(node):
dgraph = networkx.DiGraph()
nodes = deque()
nodes.append(node.body[0])
dgraph.add_node(node.body[0])
while len(nodes) != 0:
cur_node = nodes.pop()
for node in ast.iter_child_nodes(cur_node):
dgraph.add_node(node)
dgraph.add_edge(cur_node,node)
nodes.append(node)
return dgraph
# Function that returns a list of these graph for all nodes in all notebooks
def return_all_graphs(a):
graphs = []
roots = []
cells = []
for i, nb in enumerate(a.nb_features):
for cell in nb.get_all_cells():
graphs.append(return_graph(cell.get_feature('ast')))
roots.append(cell.get_feature('ast').body[0])
cells.append(cell)
return graphs, roots, cells
# Call to retrieve all graphs
all_graphs, all_roots, all_cells = return_all_graphs(a)
In [6]:
len(all_graphs)
Out[6]:
In [7]:
max_values = []
for n in range(len(all_graphs)):
max_values.append( max(networkx.shortest_path_length(all_graphs[n],all_roots[n]).values()))
In [8]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.hist(max_values, bins = 20)
Out[8]:
In [ ]:
In [9]:
sorted_indices = [i[0] for i in sorted(enumerate(max_values), key=lambda x:x[1])]
In [10]:
max_values[sorted_indices[-1]]
ast.dump(all_cells[sorted_indices[-1]].get_feature('ast'))
Out[10]:
In [11]:
# Code that created it
print (all_cells[sorted_indices[-1]].get_feature('original_code'))
In [12]:
print (all_cells[sorted_indices[-2]].get_feature('original_code'))
In [20]:
print (ast.dump(all_cells[sorted_indices[-3]].get_feature('ast')))
In [14]:
print (all_cells[sorted_indices[-4]].get_feature('original_code'))
In [15]:
print (all_cells[sorted_indices[-5]].get_feature('original_code'))
In order to do bottom up, we need to look at the leaf nodes and their parents. This is made simpler by the order that can be found for certain node types. Descriptions of each node type can be found at http://greentreesnakes.readthedocs.io/en/latest/nodes.html
Many nodes will always have the same number of children. However, some nodes have variable numbers -- Assign is a good example of this. All three of the below are valid:
Let's look at both the size and form of a nodes children to see if we can come up with a first pass at the bottom up approach.
In [16]:
len(all_graphs), len(all_roots)
Out[16]:
In [17]:
def traverse_graph(g, cur_d):
for node in networkx.dfs_preorder_nodes(g):
t_node = type(node)
if t_node not in cur_d:
cur_d[t_node] = []
child_set = set()
for child in g[node]:
child_set.add(type(child))
cur_d[t_node].append(child_set)
return cur_d
my_dict = {}
for g in all_graphs:
my_dict = traverse_graph(g, my_dict)
In [18]:
import numpy as np
for key in my_dict.keys():
print (key, np.unique(np.array([len(s) for s in my_dict[key]])), len(my_dict[key]))
In [ ]: